111 research outputs found

    Leveraging the Performance of LBM-HPC for Large Sizes on GPUs using Ghost Cells

    Get PDF
    Today, we are living a growing demand of larger and more efficient computational resources from the scienti c community. On the other hand, the appearance of GPUs for general purpose computing supposed an important advance for covering such demand. These devices o er an impressive computational capacity at low cost and an efficient power consumption. However, the memory available in these devices is (sometimes) not enough, and so it is necessary computationally expensive memory transfers from (to) CPU to (from) GPU, causing a dramatic fall in performance. Recently, the Lattice-Boltzmann Method has positioned as an e cient methodology for fluid simulations. Although this method presents some interesting features particularly amenable to be efficiently exploited on parallel computers, it requires a considerable memory capacity, which can suppose an important drawback, in particular, on GPUs. In the present paper, it is proposed a new GPU-based implementation, which minimizes such requirements with respect to other state-of-the-art implementations. It allows us to execute almost 2 bigger problems without additional memory transfers, achieving faster executions when dealing with large problems

    A Non-uniform Staggered Cartesian Grid approach for Lattice-Boltzmann method

    Get PDF
    We propose a numerical approach based on the Lattice-Boltzmann method (LBM) for dealing with mesh refinement of Non-uniform Staggered Cartesian Grid. We explain, in detail, the strategy for mapping LBM over such geometries. The main benefit of this approach, compared to others, consists of solving all fluid units only once per time-step, and also reducing considerably the complexity of the communication and memory management between different refined levels. Also, it exhibits a better matching for parallel processors. To validate our method, we analyze several standard test scenarios, reaching satisfactory results with respect to other stateof-the-art methods. The performance evaluation proves that our approach not only exhibits a simpler and efficient scheme for dealing with mesh refinement, but also fast resolution, even in those scenarios where our approach needs to use a higher number of fluid units

    LBM-HPC - An open-source tool for fluid simulations. Case study: Unified parallel C (UPC-PGAS)

    Get PDF
    The main motivation of this work is the evaluation of the Unified Parallel C (UPC) model, for Boltzmann-fluid simulations. UPC is one of the current models in the so-called Partitioned Global Address Space paradigm. This paradigm attempts to increase the simplicity of codes and achieve a better efficiency and scalability. Two different UPC-based implementations, explicit and implicit, are presented and evaluated. We compare the fundamental features of our UPC implementations with other parallel programming model, MPI-OpenMP. In particular each of the major steps of any LBM code, i.e., Boundary Conditions, Communication, and LBM solver, are analyzed

    Multi-domain grid refinement for lattice-Boltzmann simulations on heterogeneous platforms

    Get PDF
    The main contribution of the present work consists of several parallel approaches for grid refinement based on a multi-domain decomposition for lattice-Boltzmann simulations. The proposed method for discretizing the fluid incorporates different regular Cartesian grids with no homogeneous spatial domains, which are in need to be communicated each other. Three different parallel approaches are proposed, homogeneous Multicore, homogeneous GPU, and heterogeneous Multicore-GPU. Although, the homogeneous implementations exhibit satisfactory results, the heterogeneous approach achieves up to 30% extra efficiency, in terms of Millions of Fluid Lattice Updates per Second (MFLUPS), by overlapping some of the steps on both architectures, Multicore and GPU

    Towards HPC-Embedded Case Study: Kalray and Message-Passing on NoC

    Get PDF
    Today one of the most important challenges in HPC is the development of computers with a low power consumption. In this context, recently, new embedded many-core systems have emerged. One of them is Kalray. Unlike other many-core architectures, Kalray is not a co-processor (self-hosted). One interesting feature of the Kalray architecture is the Network on Chip (NoC) connection. Habitually, the communication in many-core architectures is carried out via shared memory. However, in Kalray, the communication among processing elements can also be via Message-Passing on the NoC. One of the main motivations of this work is to present the main constraints to deal with the Kalray architecture. In particular, we focused on memory management and communication. We assess the use of NoC and shared memory on Kalray. Unlike shared memory, the implementation of Message-Passing on NoC is not transparent from programmer point of view. The synchronization among processing elements and NoC is other of the challenges to deal with in the Karlay processor. Although the synchronization using Message-Passing is more complex and consuming time than using shared memory, we obtain an overall speedup close to 6 when using Message-Passing on NoC with respect to the use of shared memory. Additionally, we have measured the power consumption of both approaches. Despite of being faster, the use of NoC presents a higher power consumption with respect to the approach that exploits shared memory. This additional consumption in Watts is about a 50%. However, the reduction in time by using NoC has an important impact on the overall power consumption as well

    Leveraging the performance of LBM-HPC for large sizes on GPUs using ghost cells

    Get PDF
    Today, we are living a growing demand of larger and more efficient computational resources from the scientific community. On the other hand, the appearance of GPUs for general purpose computing supposed an important advance for covering such demand. These devices offer an impressive computational capacity at low cost and an efficient power consumption. However, the memory available in these devices is (sometimes) not enough, and so it is necessary computationally expensive memory transfers from (to) CPU to (from) GPU, causing a dramatic fall in performance. Recently, the Lattice-Boltzmann Method has positioned as an efficient methodology for fluid simulations. Although this method presents some interesting features particularly amenable to be efficiently exploited on parallel computers, it requires a considerable memory capacity, which can suppose an important drawback, in particular, on GPUs. In the present paper, it is proposed a new GPU-based implementation, which minimizes such requirements with respect to other state-of-the-art implementations. It allows us to execute almost 2xx bigger problems without additional memory transfers, achieving faster executions when dealing with large problems

    Many-task computing on many-core architectures

    Get PDF
    Many-Task Computing (MTC) is a common scenario for multiple parallel systems, such as cluster, grids, cloud and supercomputers, but it is not so popular in shared memory parallel processors. In this sense and given the spectacular growth in performance and in number of cores integrated in many-core architectures, the study of MTC on such architectures is becoming more and more relevant. In this paper, authors present what are those programming mechanisms to take advantages of such massively parallel features for the particular target of MTC. Also, the hardware features of the two dominant many-core platforms (NVIDIA's GPUs and Intel Xeon Phi) are also analyzed for our specific framework. Given the important differences in terms of hardware and software in our two many-core platforms, we have considered different strategies based on CUDA (for GPUs) and OpenMP (for Intel Xeon Phi). We carried out several test cases based on an appropriate and widely studied problem for benchmarking as matrix multiplication. Essentially, this study consisted of comparing the time consumed for computing in parallel several tasks one by one (the whole computational resources are used just to compute one task at a time) with the time consumed for computing in parallel the same set of tasks simultaneously (the whole computational resources are used for computing the set of tasks at very same time). Finally, we compared both software-hardware scenarios to identify the most relevant computer features in each of our many-core architectures

    Many Neglected Tropical Diseases May Have Originated in the Paleolithic or Before: New Insights from Genetics

    Get PDF
    The standard view of modern human infectious diseases is that many of them arose during the Neolithic when animals were first domesticated, or afterwards. Here we review recent genetic and molecular clock estimates that point to a much older Paleolithic origin (2.5 million years ago to 10,000 years ago) of some of these diseases. During part of this ancient period our early human ancestors were still isolated in Africa. We also discuss the need for investigations of the origin of these diseases in African primates and other animals that have been the original source of many neglected tropical diseases
    • …
    corecore